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P r 0 f .K c r> 


T h 1 jj final report la b c 1 n r, a u h t J 1 1 e d by t h n St a n f o r d 
University o no <1 1 ca 1 Aa p .1 i ca t i. nns Toan (SIU5AT) uiulor NASA 
Cooperative* r e e n o n t N C 0 2-113. T_t presents an a c c o tint o F t h o 
a c 1 1 V 1 1 i 0 a n (1 r e a u 1 1 h of S IJ U A 'f ' n ‘.M o b 1 1 1 1 y Aid f p r H 1 1 n d 

tochnolO{>y transfer project. The report has been prepared for 
NASA's Technical Ifonitor Henry bun» Ph.b., Cliief of the Project 
Teclinolof^y Hrancli nt NASA-Aines desearch Center. 

Although Cooperative Agreement NCC 2-113 defines Its period 
of performance as 1 December lysn throug)’. 30 November 1901, SUPAT 
did not receive Its Notification of Award from NASA until 31, 
January iOSt. The report, hnwi'ver, covers the entire period of 
performance and Includes work done hy SHHAT in Decenbor and 

January 19H1 In anticipation of av/anl. 

T ti e v; o r r e p o r t e d la a r e 1 n w n s (l o n c* under t h e g e n i* r a I 
supervision of SUHAT r.xe.ctitive Director Donnld C. Harrison, M.D. 
!)>' consultants J. ’!. T(*nebaui.i and Robert J. Deijs, c<> 1 1 a l>o ra L o rs 
‘Me >.( n 0 1 J) e e r 1 n j* , Carte r C o 1 1 1 n s , a n d T J i i H e a 1 y , S U 1'. A T r e e t o r 
Ca r y 1. . S c e 1 n n a n , end S IJ R A T S (? c r v t r y M a r i\ o !5 e 1 1 n m y . T !i v 
p rug ra m m 1 n", a ss 1 s t a nee of Lynn Cunri, the secretarial assistance of 
Ji a r 1 0 n Ha z e n , n n n the g e n e r o u s 1 l y o f H R I T n t (> r n a t 1 o ii a 1 1 n 

p r o V 1 d i n g u $ e of 1 1; e 1 r f a c 1 1 1 t i e s a r e g r e a f f u 1 1 y a c 1; n o w 1 (> H g e. <! . 
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P u r p 0 H o £ r u p 0 r L‘ . I' u r t r 1 ‘i P- 1 , the 15 1 n n r o r rl h ti t v t* r k t r y 
niomedlc«r \ppllc.iMons Tenn (HUHA?) collnhonattul with tin* r.nltli- 
P(?ttlewell Tnatitute for Cho Vi sun I cl one us (fJKIVJ!) and flio 
Project Tech iioloj{y nrnncli of IJASA-Anos Pesc/irch Center (APC) to 
parry out the f i r a t i t n .■* e of a t c h n o 1 a y i r a n s f u r f> r o j o r t 
entitled .M iCibl 1 1 1 y A1 d for the iUln d- .* This ropfjrt pr c n t* ti t s IjCMAT's 
final a c count i n p, of t ii o a c t 1 v 1 t i c s a n d r «• n u I t n of t h i s 
collnhorn tivc effort. 

0 V G r V i u w . TilCIVf) has been Gri'»,i''ail for several y(>nrs in a 

project to develop an effective uoMllty aid for blind pedest rl aiui 
(1) wl.'lcJi acquires conseentlvc in apes of the see nos before a 
1*1 o v i n }’ p G d 0 s t r 1 n n , ( 1 i ) i; li 1 c h J o c a 1 1 » n a n d i d e n t 1 f i i »: t h e 

pedestrian's path and poteiUtal oust ados In the patli, (Hi) which 
presents path, and obstacle 1 nPoni/U ion to the pedestrian, and (Iv) 
which operates in ronl-tine. The nobility aid bar, tliree principal 
oonponents; an lna;»e acquisition systtni, an ina»»e. 1 n t e r j»re t .a t i on 
s y s t e I.) , a n d a n i n f o r m n t; 1 o n p r e s e n t a i 1 o n s y s t. e n . T li e in a • * 
acquisition system consists of a n 1 n i /i t u re , .so J id-s tr to TV camera 
which tran.sforns the* .scene before the hi lad [>ede«trtan into an 
Inane v’hich can be recelvotl by the Inaf’S interpretation s vs ten. 
Thy iiia.qe interpretation .system Is in pie men ted on n mlc rfjf'mce.ssor 
uliich hn.s been proaramned to execute real-time feature extraction 
a n d .see n e a n a 1 y f5 1 .s a 1 y> o r 1 t h n ,s f o r 1 o c a 1 1 n <» a n d 1 d e n t 1 f y i n y t h e 
pedestrian's patli and potential ohKcade.s. Identity and location 
information is presented to the p o d e r. t r i a ri !> y m cans of t a c t i 1 < ’ 
codln.ij and inachl ne-p, one rat ed spceclm 

Ob .) e c t i VO s . The ohjoctive of the technolopy tran.sfer project 
which i .s Inheddecl in bl'.tV!>' mo!)ility aid pro'iram is to develop, 
implement, and transfer the required feature uKtracdon and scene 
analysis .software. The ultimate pool of the project is to 
overcome 11 m 1 1 a 1 1 o n s i ,-t t )i o c a p a e 1 1 y of 13 K I V S ' m o *i t recent 
prototype i!i rj li i 1 1 t y aid t o " u n d e r s t a n d " t y p 1 c a 1 u r b a n fj i d e w ;i 1 1' 
scenes in real-time. The present .study has been under taken to 
determine whether t h I .s poal can he achieved witli existinp imapy 
interpretation algorithms and, if so, v/liat must he done to 
ImpleiTiont such algorithms for real-timo exectiCion on the nobility 
aid. 


G h a n p e s in. scop e o £_ s t ii dy . A s e f| u e n c e of e v a n i r, h o y o n d 
SIip.Af’^^s control has forced BinVAT to modify its approach to 
achieving study poals, to reduce the scop » of the study, and to 
extend the .study schedtile beyond su’mi.nsion of this final report. 
The three most serious event.s v/ere (i ) n two-nonth delay In lABA's 
Issuanco of a Notification of Av/ard for thi.s .study which re.tliiced 
BUBAT’s period of performance from one year to ten nontlis, (ii) 
NASA's termination of SU.BAT which Forced tlic revl.ston of Ho hi 11 1 y 
Aid for tlie H I i nd project plans to accomodate snnAT'.s demise, and 
(111) NASA'- •allure to provide agreed upon image digitization and 
c o in p u t e r f o cessing services which effectively e 1 i n i n a ted t li e 
possibility of conducting image i n t e rp ro tn t i on experiments. The 
impact of the.se events will be d 1 c u s s e d in later sections. 
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nn cl* «'roun d 


C 0 n j;i u t (* t* y 1 s i o n . Tlu* prtMient Ina/'c 1 n 1 1 * rj) r •• t a 1 1 o n iirMblcr 
falls wrthiTi tha Hcolit* of co upu t:er v i s i chi , a fldrl of 1 iifcri*at it: 
which artificial 1 n t a 1 1 1 ;>o nc y ieehnifiuoa nr a usim! to ciutov/ 
computers with the capacity to perceive nuri understand lm.i;;es. In 
practice, iuap.e interpretation Hclienes can p<»rforn wit!i a t’c.'ree 
of 1 0 R i c. n I H o p h i s 1 1 c a t i 0 n t lia r ' c a n n o t h <• a c h 1 e v i> d u» ( t h 
c onvc.li tiona 1 co5i)>uter proeranninn teclsntfiues. Hueh pe rfur Siance , 
however, typically in confiiiecl to wry strictly com mine' iuaf.t* 
domains f(»r a p.lvcn imn"e. Interpretation system', 


For t li (j p u r p o s e n of t h 1 ?; r e [m) r t , the r o c e s n of inn p e 
i n t e r p r e t a t i 0 n can 1)C viewed an invelvinp, a model of an i a;', <* 
domain and two ]>asic operations? feature e:<tmetinn an*? HeuuntJe 
analysis, e x t. r a c t ion c a n s i s t s o f r e c. o v e r i n cert a i n 

f umia me n ta 1 c ha r n c t e r 1 s 1 1 o n of nn irnt’o or ntj'iuencc of i:ia‘»/f' 
tncludln;», for example, ed|>e locations and o r t en t a 1 1 ona , surfac,' 
texture a and colors, 11/; lit intensity .levels, and c n u t r.j >: t, , "'U«> 

<1 e. 1 i ti e a 1 1 o n of s u c 1 c h a met e r I n t 1 c s e oast I t it t e a c o m n ter' s 
description of an ima/'o which muni he i tu e r p r e 1 1 * d . Ad i '^sht* 
do liia 1 n n p d c 1 1 .s a d e t a I I e d d y a a i;' i c d e e c r i p t i o n of L fi e v T s ti a 1 
environment of Intore.st. It includes as *uich Information .ihout 
tlie possible status of the visual envlronrient aft lie of 

interest to the user, and It ahntrr.ctn thin infornatlon in terms 
of tlie characteristics which lie recovered by Tea tun* 

extraction from lmaf»e5i of the environment. F i ■ i-i a n f i c .-i n .i 1 y s 1 s in 
the proces.s hy which the ciia rao t e r t s t i cs of a ;>lven iiiia/'C* or inapt- 
sequence are related to the ina/je liomaln no del n t) prot'uce an 
interpretation about the status of I tie ,*lsual environment. 

Com pute r vl si on 1 n t he noli! li ty aid . After several ycar.s of 
research and development wnrh, h 1C I VP has p* rod need a computer 
vision mobility aid for lilind pedestrians. The main poal in this 
w o r k li as been to e n d o w t h e. a i il u 1 1 li the capacity to 1 o e a 1 1 * 
sidewalk position relative to die user a n d t o w a r o t li e u s c* r u f 
objects which might block his path or ji re sent a hazard. The 
current sy.sten, which is Implenenteil on a 16 bit microprocessor, 
achieves t li i s goal under very 1 i m 1 u c <i c c n vl 1 1 i o n s . fi e v e r e 
constraints are Imposed by the need to operate in real-time. Tlie 
constraints restrict tlie image domain, flip, algorithmic approach to 
feature extraction and semantic analysis, and the level of 
n 0 b i 1 i. t y aid p p r C o r m a n c e a s c x p r o s s e d b y o b s t a c 1 e d e t r; c 1 1 o n a n d. 
recognition rates for various classes of object.^. 

The ima/'.e. domain is restricted to clean, sun-lit, ’lostly 
shadov/-f rec .sldewallcs. Certain initial conditions, such as caip.'ra 
orientation and sun position, m n n t b c s u ]i p 1 1 e d e x t e r n a 1 1 y . 
Performance specifications arc expres.sed in probabilistic Icrms to 
allow (i) the occassional occurrence of v/arnings about non- 
existent object.s (false positive.s) and (il) the occurrence of 
<1 e t e c 1 1 o n failures ( f a 1. s e n o g a 1 1 v e s ) , e s p eel a 1 1 y i n r e s p e c t t o 
objects whose sixe falls below the camera's Myqui.st sampliu". rate, 
objects whose contrast with respect to the backgrniind is Ini?, or 
objects whose images are vlcM.'ed against a highly textured 
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bnckf.rouivi!. The ru'C'ri to tlervl with c .sKorir ial ly 1.1 o;i] m-! 

proh»lI)lllBtic perf oen.'itu'c apeci H ca t Inns in typle.d ol’ •it.un-ni- 
tlic-art efforts in other renl-tl u* a p p 1 1 <m i 1 o uu of cn-.puifr 
vision! i »n»i indust ri al parts roco.jnitlon , hloml coll cmmih t J ofi , 
a a {* a ti t n n n u f il n a v 1 p, a t i o n . 

T il e n o 1> 1 lily a 1 si roll t* s on a c r u d v hat f a n (: f v a t u r <• 

raet 1 on alp.or i chi.i time optrates ad om* video franc* per second 
t o fietfcL and link udp.ofs. More aoiililsticatt'd fcaturi* e u rant ioa 
operations cannot be done at t ho present tine, He nan tic analysis 
Is anortiaed over several fra ties so tfiat predictions fron 
previously a n a 1 y tJ*. e d franes can he used to facilitate uam rapid 
i nderpre ta t [on of the current frine. Th e^ t- ve r.i 1 1 approach Is 

s i 11 1 1 a r to a p p r o a c. li o s e u p J o y i* d h y r r o o k s “ a n <! f' ii a p i r o . Y h <• 

a iio r 1 1 z a 1 1 0 n of sci'antlc a tun lysis c»ver niiltipie fra»'es, 
pa r t ljC. 1 . 1 a r , is an approach wliieh.Jias been used by h' i 1 1 i i .. f. ‘ , 
Tsujl'', Poaclr', Yevatln , and bavin". 

T II e .‘j K I V n n o b I J i t y a I <1 c a n p n i d c» a b 1 i n d p e d e s L r i a n 

successfully down t li e c *'nter of an urban s If! e walk under Lb 

const ralntsi. If any of the constraints are violadet', liowever, t !ie 
systcM usual ly fails. fniall olijects are never detecte«<, for 
example, and sluidows usually co!ifii.;(. tlie system. The constraints 
refl{>ct I'deal but unrealistic c i rctrist ances for molitlity at ' use. 
Practical usi^ rec|,uires the <!evel opnent of feature t-x t ra c i, 1 mi and 
a e nan tic, analysts soft\;arf* tliat can operate under more realistic 
CO n<’ 1 1 ions . 


nverconinp limitations . Th e limit a t i o n s o f .1 1' T V fS ' p obi 1 i t y 
aid an fl m o s t other r e a 1 -■ L 1 1 1 e c. o m p u t (• i’ vision s y s t e n <; a rise from 
the. nend to carry out feature extructiou operntlotu! at vid(*o fraiie 
rates on cost“ and s i x e - 1 1 n 1 1 ed c o ,i p u t a t i o n a 1 resources. This 
rcfiulrenont currently forces such systems to rely on crude* feature 
extraction alpo rl tlir.i s which pernlL only simple cdn« detection aurl 
linking and on equally slnplXfied floscriptlons of i ;ia 3 e domains. 
Uellable r e c o n i t i o n , on the other hand, rtiquires mucli fiore 
s o p 11 1 s 1 1 ca t e d initial feature extraction as v;ell as richer 
descriptions of i m a a e domains. For example, if fl 1 s L a 1 1 c (.■ fro m 
viewer, t li r u e-d 1 m e n s i o na 1 shape, relative orientation, fiurfaco 
texture, and color were captured by tlie mobility aid, it v^mild be 
possible to distinguish shadows or flat objects on a sidewalk friin 
real obstacles and to r e c o n i x e t h r e e - d i ii e n s 1 o n a 1 o b j e c t; i* 
ruj’ardless of viewpoint. In view of the fact that VLhl te r; lino lo»y 
is revolutionizing the capabilities of snail, i n e x :> ff u n i v e 
computers, it is a p p r o p r i a t e. t o c o n s 1 d e r 1 n c o r fu) r a tin p, n o r a 
elaborate software in the rifibllity aid so as to achieve more 
practical operational capabilities. 


A variety of p o v/ e r f u 1 c o n f> a ter v i s i. o n tec !i n 1 q u c s a r e 
potentially applicable to the molfillty aid., Uur exaTiplu, rarrou 
and T e n e n b a u i.i ■ review I e c h n i f] u u s for r e cover! n p, s ii r f a c e 
fiesc riptions based on stereo cor respondance , optic fluv,', texture 
gradient, and other basic cues. Other potentially a p p 1 1 c i 'a I c 
techniques include (i) surface shape recovery from photometric 
ahu<'ln.P'>. toxluroU.m, = t o i: o op s i s H , 1 'i, ,,n.l notion flo,.mi5. 
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(li) ‘’{att'CClon oi lon;», atminlit Xtm*»} n.j'itnEt con pits Ii.i 
hy miiiiis oC Llit* !Iou,ili tranaforn; and (ill) Inap.e pa r t i 1 1 uni n*' i,y 
tivans of tho rucvu'sivf top-down al:»orlthu. 'lany uT tlu':! 
Cfchnlquut wui*u duvulopud In the context of nA'A-apoiiiiorud 
rtj,'joticB and rnijato (jotuilnp I’cnoarrh. 

A pp p u nch 

Ini tin I a pn roach . In pgnc ral ti-riio, the* Initial ,ipprtiac'. t tt 
Llto lirat i.ti‘»(‘ of this technulopy trau.';fi.*r projiTt conn 1 »• t od of 
( i > r 0 V 1 u V/ 1 n p, t* x inti n n i i.» a p u 1 n t. o r p r o t a 1 1 n u L o c h n 1 q u t* n , (11) 

1 nploncjtUlna po t on t la lly appHcahlo toelmi, qua-B on a «. u UaMo 
ci 'V la p u t 0 r s y B t 0 H , (111) d o v o 1 o p i n «> an nr b a n n I -1 o v/ a 1 k i ,5 o 

databaao, (iv) conductino inap.o i n to rp ro ta 1 1 on oxp o r 1 u on t a to 
dctofnlno v/hlch tochnlqucts ptod.uco tlio boat ronulta on the l'.iap,*‘ 
database, and (v) thereby Idontlfyinp tlic ina;;o i n to rp ro La t t on 
ruaources and need a In connection with later BtaqcK of the 
project. 

U a f 0 r t iMi a t e 1 y , p r <; v i o' u b 1 y d o s c r 1 b e d a v o n t a p r o i! u e c i! 
u n a n 1 1 c i p a t 0 d t i n o. a n d r (* jj o u ,v c e c o n s t r .i 1 n t b v/ h 1 c Ii c f f o c 1 1 v 1 y 
precluded ear ry Inp, out any ex p e r i n c n t « . It bc'cano nocoBHary to 
nako nld-projoet revinloiUi of npi)roach and to atrotcli roBourcc-B bv 
aollcitinp, services and naterlals from concerned Individuals and 
1 n H 1 1 1 u 1 1 0 ns 

Rp vi fl ii d a ppro ach. A luiw approach was dovifiod v^hicfi taken 
ad. vantage of an alnost universal desire aiionp conputer vlaluii 
exports to extt>nd the applJention of their tcclinolyp.y to no v.’ 
V 1 B u a 1 environ ni o ri t s a n <1 t o o i) t a 1 ii a c o n n i -s t e n t h a s 1 b‘ for c cj n p a r 1 n <> 
Inajto interpretation t e c h n 1 q ii a » . Tills approach Involves (1) 
eonptlatiun, d ip, 1 1 I. xa t i on , am! dls t rlluj tlon of an urban sldt'wnl'- 
1 ma;’e <latabasc' to a variety of conputer vision expriTs, (11) 
conduct of voluntary iMajtc i a t e r p r e t a (: 1 o n ex p<; r 1 n e n t » on the 
iiatabnHC \/lth {>nch expert UBlnp, Ills ov/n facilities and soft-ware, 
and (Hi) asses.snent of the expo r 1 non la 1 results by the H ly 

A 1 d f o r t h o Jli. iml project t o n n f cj d c t e r n i n o ( a ) t h e e x t <» n t; t o 
VI li 1 c h a vail a ti 1 a techniques c a n o v e r c o n e p r e s e n t nobility a 1 d 
lliiltatlons and (b) which linltntlnns of tlu’ nobility al-i r(>qnl re 
additional research to surmount. 

Interest a i-. o n n computer v 1 h i o n o x p e r t s in the h o b£2,i,L Z Alii 
for t h e Zl.lJliL Project n n s been s 1 1 n n I a t e cl h y t v; o Inc e n 1 1 v o .s . 
Firstly, tlie dat e has been compiled In such a i.'ay tliat 

it can serve as a unlversdl stanJaril for objective cotiparison of 
alternative Imn^e interpretation alp,orlthns. The field of 
computer vision is in need of sucli a stantiard, and this project's 
database v?ill bet of lastlny, benefit to the rescarcli coruinnity In 
that role. nccondly, modest honoraria were offered for Lite most 
i> f f e c t i v e techniques that are reveal e d h y v o 1 u n t a f y 
experinenta tion . 
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MothoflB nnd SU>eu 1. 1 « 

t>verview . 08 fiM*nrinl acpoiipli filmpni; of the pro jeer tran 

to date luiB ho«n the conpibition, di p, i tl«/i cloo, ami <!t h t r 1 bu c 1 on 
of an urban aid own Ik iMfi'jo da t aba a o w lit oh riany of the? ffiro;“Ofit 
roHon rcho rs in compntur vialon have aj'roed to profoas la thoir os/ii 
1 a bo r 0 t o r 1 C! a . A ro t roa pec 1 1 vy o‘x a n i ri a 1 1 o n of this eIntabaKo 
ruvonled a nuittber of photometric quality problomw which art* bolnp 
corrected nt tlie proaent tlmot Ar riin'*o men t a have been to 

conplcttt the preaent utudy, and a voluntary follow-up report will 
he prepared under tlic BuperviHlon of J. II. Tenenljaum. The report 
vMll convey an analyBla of the rcHiiltK of expo rl n c*n cr com’uetud on 
the project's imapje dntabnRe by the varloua resenrcliorB who have 
a|;reod to participate in the project. 

T e B 1 1 n ,q p r o t o c o 1 b ^ p e r f o r rt .t n c t» h p c cl f 1 ca 1 1 o na , an^d £££!!£. 
var tablca . The pFototype nohillty aid wan tested under a protueol 
which was defineiJ In connection with t!ie ‘’Pr-aponso red 'soblllLy 
riid development project. A BUMiiary of Llie protocol am! aiipllcable 
performance s p e c. t f i c a 1 1 o n s are i, nclmled an Appendix A to tills 
r e p o r t . 

An analyslH of the tent romjlta revealed Inutancef,! of oeenen 
which can Beil the noblli. Ly aid to fall and identificil the nee no 
V a r i a hies w h i c h v? ore r e b p o ru? 1 b I u f o r I h e fail u r e a ; ( i ) s i a e a n 1 
type of 8 U a d o w a on the a 1 di* w a 1 1' , (11) d e ;; r e e o f c o n t r a k t li e t w e e n 
adjacent rep, Ions of an imape, o.q., between .sidewalk and stret't or 
raws, (ili) tlie number of objects wliinh clutter the scene, and 
(iv) the incidence of "ntepdownt;" or iiolus in the si dev/a Ik vniieJi 
i.s beinp viewed. The.se v/trlo!)les were incorporated lnti> the 
d e s 1 s; n of the i n a p e d a t n base to provide for a n e f f a c t 1 v i- 
a. SB ess meat of available iwa»e interpretation techniques. 

Lo.sb formal reportfs of user experience tjore productive, also, 
in c .s t a b 1 i s h 1 n p, a n a p p r o a cli t u the as f» ess n e n t of n o b 1 1 1 1 y a 1 d 
performance. It b e c a m e apparent ( i ) that it 1 h necessary to 
dl.s Linitulsh classes of objects that differ in relntlvo Importance 
to a user's safe and efficient travel and (11) chat o b s t .i c I e 
detection and recognition rate.s must bo Interpreted In the context 
of such a classification scheme. haaically, there are several key 
items such as sidewnllt edges, potholes, broken sidewalk slabs, and 
large obstacles which must be detected. Also, there are secondary 
items such as snail off-path objects and discinnulshl ng fe.ature.s 
of large objects (e.g., car vs truck vs bolder) whose detection i.s 
deslrcablo but not essential. Finally, there are items .such a.s 
shadows, fallen leaves, and sidewalk stains which mist he ignored 
or suppressed by the mobility aid. 

The i m a g e d a t a b a e of 
representative urban sidewalk scenes was recorded photographically 
using a 3 5 mm HLR still camera and a lb nm movie camera. llnch 
scene was recorded as a stereo pair of 3 Tj nn po .sit ivy ctjlor slide.s 
using a single tripod- mounted cauera nt head li eight. A single 
camera was used to obtain sequential exposures of the lialve.s 
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of each j^alr, an4 a hurijsuiUnl r«ll nouiU waa tUvi?;*-* .ilou', 

vjhlcli vlie caiuTJi could be iranalared hetweeii expoj.u r e« tt* aciiifVi 
Htereo uuparatluu, Tiilu appruach factUtuted uxe uf a h*»H 

for both ha I veil of a sieceo exponvtre aiul, the re fa re, 
variatlonfe in lerjK ilistertton effeeta between halveH. A at<Tru 
reparation of ?.5 iticheh waa choiien to at mi late the lu»;ian e‘c 
Kyutei'i. Moat slides were tak<*ii with a sixty de‘',r«e ftel>? of vl« w 
V , 0 Blnulate nornal luinan peripltviral* vision, and a few wlltie:; wen 
taken with a four decree floid of vtev/ to Blnulate liunon fuvial 
vision# Thu narrower field iMar.ea appear ns elose-ups of texture 
a n d n 1 rl e wa Ik cracks. 

fie.iec.ted seeiu‘8 were recorded as ir» nn novleB to obtain 1 sai»e 
0 e q u e n c e 8 of a sort that w o u 1 d be s e e n h y a « 1 !i t e d p (« d i* n t r 1 a n . 
f'o attenpt was nade to produce stereo Movies, 

Da t a 1) a s e . I'orty stereo sllilos were taken of Derl'.lty and ban 
I'ransJsco sld<*walh scenes. Twelve of these slides wen* sehi*tc.i 
for use in t!ie database, and eleven these slides are rejjroj’uced 
as ('hotocopie.s of prints in Appendix D. 

The tvJi'lve slides were d. I /, 1 t 1 a ed so that thi*y {•oulo’ bf 
dlstrihuted. in Machine readable forn. Dlnce KA‘!A~*'.jes DeKenrel* 
Center could nut dl.'?lLi'/t» the sliiles as h.id been previously 
agreed, the project tenn arranged for no-cost d t at 1 1 xa 1 1 oti i c3 !»e 
d(3iu» at r>ri In t e rna L 1 on.! 1. Doftv/are v/as written to speed up tiu* 
»• l,»i t Ixat ion process hy pernlctlnf. the 1'j nn flln strips (uncut 
slides) to he wrapped arouud a trua, dl;»ltixei! en .'la:. s<, and 
|) a r t 1 t 1 o n e d Into Individ n a I f r a a e s !) y 1*1 e a n s 0 f *. cj f t v; o r •> 
iianipn] atlon. This technique p.roduces unifor.iiy di.qltlxed tna‘*i*?; 
uJid sii'f|)Xl fXes the re p, i s I r .at 1 on of sioreo pairs. filaici* Lhei'i.’ Is 
no universal file fornat for conputur inaj’es, a self-descrl pt ivt 
format wlsich is compatible with nosL vision systens v/as di-vised. 
The scannin;; procedure and file fornat are described in Apf/on-lix 
C. 


Proble ms a n d cor re c t i ve n e a s u r t‘ S . A 1 c h 0 u p, h the r »* s u 1 1 1 n .<» 
dlcltllTed imagery appeareli ac’c’ep'tahlF on a raster display, its 
ov(>rall quality Vv*ns InarU* quote for use as a standard dntaliase. 
Two specific problems hocarie apparent: (i) The dynamic (lipht 

intensity) ranpe of the 1 napes exceeded the linear ran.q<* of the 
films density curve because (a) fast Film was used, (1/) the Film 
was silphtly underexposed, .and (c) hip h contrast development 
techniques were used. As a result, Inap.e det,'ill in regions of 
shadow and hripht illumination was conproinlsed , and the data 
became unsuit.iblo for interpretation by photometric techniques. 
(11) The stereo baseline was too sr.jaXl to produce sufFlclent rnnqe 
resolution. Given the inherent .spatial r e ,s 0 1 u 1 1 o n of C h e 
digitized i r.i n g e s , for example, It i s n o t p o a s i b 1 e t 0 (i e t e c t a 
sidewalk curb .it a distance of ten feut. An analysis i.s provided 
in Appendix D. 

A new set of slides Is being taken to correct these problemfj 
in the database. AOA-25 film and cu.utom developing survices are 
being used to obtain better dynamic, raap.e and inn*;e quality. The 
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Btert'o biiiteHne 1« tn width 

to Inprovc rnnije roaol >it ion. Ultin.itfly> tU<^ inank* i nt L*rjiri‘t-» t I **n 
aoftwnru In the mohlllty nid will he written tu floeopt j.ieren 
iiiiUie pnlrn wltli a stereo b.inellni* of Inehea. l'i»r tin* 

pro a out, however, the i‘x tended attMH'o h awe line is ren«‘irid t »> 
fncllitaCe the* une of exiatin;» «of tv/n re for i‘Xperl‘u<!it.il |»uriu>*n».. 

nollcl tat io n o f pro jert ;>i ril c 1 pantH . The n.j rt I n i pa 1 1 o«» ut 
c u n pu tor V 1 bT on r esoa rnhe r» in t!u- !fo hill t y Aid fop the I'Li'i*! 
P r 0 j e c t u ;i *i a o 1 i c 1 t e d h y m a 1 I a viTl"" h y* inf u r * a l"* I’ u u t a v T'a t 
profoBHionol lUM'Llnp.B. A revit*'/ af the eonputer vif.ir.e teehniral 
literature revealed « eve ml p.i(urH w h t e h de*;rrliM‘‘’ 1 
interpretation teehniques, a l.»a r I I'u.f; , and fn*f.te‘ii; of roLertial 
relevance to tlie tiol4lity aid prohU-n. ’hicJi am hop 'fan nedt colm 
prints of tlie datah.'use ijuu’i'p and ,i letter of invitation in 
participate In the present Kturly. The letter and Its di n i r I hut ion 
line are attached aw Appendix i. to Uiln rejuirt. Adiiri.mal 
irtfortnnl contnetn were nade hy t hi* project tean al t du I*'.’ I 
International .1 r> 1 n t Confer e tt c e cj n Artificial I u t •< I 1 1 ; e n r e 
(Pecrinp,) and the Kill’ Uorkshop on II n nan aei Ilachine Perception 
(Tc nenbiuii'i ), The reapon.ae to the project lean's s o I i c 1 L a t i on 
efforts exceeded oxpec ta t Iona. A list of the elf'hteen res«*a rcht*rH 
who liave Indicated a w i 1 1 i n '* neSfj to participate in tlie prea( at 
study in provided In Appendix F. The list includes i.jany of the 
forenost expertH in the field of cotjputer vision. } ii r t her >i o re , 
the specific interoBts of these experts span nost of the nohlltty 
aid's problem areas. 

P r e. 8 e n t a t i o n s a in d £ t^ o I f 1 c h a a 1 1) c (* r i n n p r a a e n t e <1 

t h e ]) a p u r *^e a 1 - T ;l7i e fi a L u r a 1 d'e e u e Analysis for a 15 1 1 n d 
P r o s t U e s i >s " at the 1 9 d I I n t e r n a t i o n a 1 Joint C o n f e r e n c e o n 
Artificial Intelil"encc in nuattle, 24-2B Aujiust IhBl. '^ue papor 
is included in this report as Appendix C. Also, Jay Tenenhuuu 
delivered tlie Fey no to Address at the KBP Jorhshup on Hue. an and 
Machine, Perception in Penver, 1?-14 Auj’ust 

Plans . The revised inap.e database will bo distributed to the 
voluntary [) r o j e c t p a r 1 1 c i p a n t s who will c o n t* ii c t I n <i p, e 
1 u t c r p re t a t i on experiments. Hxpe rinentnl results »/iJl be repartee’ 
to the Mobility Al d for the lUind Project Team and analysed. The 
.inalysiB will be reportc»d to NAFA-Anea Posearch Center (Lon) in a 
voluntary follow-up rei’ort to be prepared by Tenenhaor and, If 
warranted, in the o]>on literature or at a professional confi* ri*n<*e. 
An appropriate a p, e n c y , p r o b ably T C F , will be n e 1 e c t d. a o 
pornandnt repository of the lna:;e database. Finally, the projeet 
team will develop proposals to otain funding for later staqi.'S of 
tlie technolop, y transfer project. Spoclfic plans include proposlri;* 
(1) continuation of nobility aid developnunt to MBF and API’ A and 
( i 1 ) VLSI 1 ra p 1 e n e n t a t i o n of i n a p e interpret a 1 1 o n a 1 p, o r t t b n s t o 
NASA. The latter activity will he proposed in the context of 
robotics applications for orbitlnp space platforms. 

Conclusions 

The first stape of the Hobi li ty Aid for tliu Lllnd teclmolopy 
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trnnxfor projacfc ha» nut baun hh pliitHH‘4 duu tu a 

giMjuunca uf evt*nti» buyoiul fJUHAT'g i*uutr#i!, Muv/uver, Lbt* pnijuut 
taan bag provided for the* eventual coiTpU'tiuu ui l»r<fp(i*H' * v nr* an*’ 
for the auhulsHlon of n follow ‘•up report oiiiit* i ht* work H-uj '.ni'tit 
couploted» A] HO, the project tu in Ik prepared to draw 'iever il 
eonrJuslon fron the work arrorp Hnhed t«» date; 


1 . T h V o h i 1 1 t y Aid for t h e* 'i 1 1 ii d f o b i 1 i L v p r o I c* e t I r 
perceived *ik belu;; worthwhile by the coiiiiuter vIhIoii reueire! 
con mi nicy. Many conputer vlolou exj; rtf; havr* a ; r e e t’ to 
parti c,l]»n to in the project, utui feedlmck conce rn 1 n‘>, the puietifial 
sine of fctie urban jiitlewalk diitfif»a'se an a ttt fudard ti’nt’iel 1 1 » r 
coiiputer viol on tcclinolopy has lieen positive. 


2. It is a n 1 1 c i pa t .*<1 that tin* Iweive Ipiaeis In t‘i«> orhaii 
Kidsjwfilk datalHiKe will exert kuo! t aut i al levernpc on tin* furih<‘r 
devolopnent t».f tt o i.i pu t e r vlaisin and that the 4a t si lia s t- will as* tjf 
lastlnj; benefit to the coiiputer vision rc.'re.i fch eon itiii! I t y. Tie 
t-'lde distribution of the " J n*l u s l r 1 1 1 parti; 1 in" iraae 4 a to base 
Hever.il years a';o had Hueh au innact. The urban Hislcw/iH’. database 
is noro couplex n lul , for the first tiue> permits htit h (1) 
co'iparitivc asscsfinent o'* a variety of ala.orithuH and approaehes 
on a Sim; In databnsi' aa < (it) ihe ns i; eiiiuien t of h 1 Piul taneinss I y 
applied nunplliacntary approaches and i nteprat tu! approaches on a 
complex standard database. 


3. The reChodolop.y of d t« t r 1 bu t i nc, lnap,t:fi rather than 
d 1 s t r 1 hu 1 1 a;; a3};orlthns aliould jirovide. many of the nrlentlfic 
results that are supposed to be produced by tlie more a(<hlttous 
Ah FA Imago Unde ra t and ing Test be tl Project (Uhl) at a fraction of 
chat project's coat. 


The use of the nobility aid database as a st.intlard 
d a t a b a sc will focus t b e r e s e a r c h c o ri n u n i t y ' s Interest o n t h t» 
nobility aid problem for years to com*. The results of such 
focused efforts should be a fm hs t a n t 1 n 1 Improve n-ent in the 
performance, cost, and ilesign uf mobility aids an»l oil'cr vision 
a i d H for t h e h 1 1 n d . 


It is anticipateii that no one algorithm will prevc to he 
superior in in age. interpretation per forma nee un urban aidewal^' 
seen e s . D 1 f f e. rent a 1 g o r 1 t h n s will be n f) re s u c c e s h f ti ] .a t 
r c c 0 g n 1 a 1 n g tl I f f e r e n t c 1 .a s s e s o f o Ij t a e 1 c* s in t h <• !i 1 i u d 
pedestrian's path. Ultimately, the uobiliiv aid will Incorporate 
several different a 1 g o r i t li n .s in or<ler to achieve ade<juate 
perf or nance, over a wide e.lasB of cjbstaclas and conditions in the 
visual environment. The present task is to ascertain which 
existing algorithms and techniques sliould be incorporated into the 
nobility aid and nnrl wliat new approaches must he develupei ainl 
incorporated. 
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EXCERPT OF 7 AUGUST 1980 LETTER FROH CARTER C, CO 

DEScRiBTOmiL 1 Y V 


IINS TO GARY L. STEINMAN 


W\ 


mA 


In refinance to the NASA proporal, wo would tentatively formulate 
porfornmnci ipocifications for the electronic mobility ^'id visual 
procdsein^ algorithmu as follows: 

Class Is Detection of large » salient features in the environment 
such as poles (as indicated in our original grant request to NSF) and, 
in general, those obstacles relatively easy for the algorithms to 
detect. These will include features over 1’ in height or depth in- 
cluding boxes, low walls, parked cars, bushes, etc. We would intend 
to supply sample photographs of sidewalk scenes with such items clearly 
iabolled. Class l objects should be detectable at the 98% correct 
level. 

Class II objects would include such items as the edge of the 
sidewalk or path, edges of a lawn or wall, crosswalk markings and 
other irregular path delineators. Detection and recognition accuracy 
for Class II objects could be 95% or better. 

Class in objects will include items such as curbs (stop down) , 
broken sidewalk slabs, rocks, small objects, litter, and largo chuck 
holes. In general Class III objects will be 4 i.nehes or larger in 
vertical extent. The detection and recognition accuracy for class ill 
obstacles can vary over a range from 00% for those obstnclee which are 
at present most difficult to virtually 98% for those which we can 
easily handle v;ith our present algorithms. With further analysis we 
will be able to tighten the specifications and break them into sub- 
categories, each with its own quoted accuracy of detection. 

Class IV objects should include items down to 1/2” in height, 
such as a fractured sidewalk, broken tiles, litter and other small 
items on the sidewalk. The major use of Class IV is to reject items 
such as leaves and shadows which would not constitute an impediment 
or haaard to the blind pedestrian. 

Our current microprocessor-based system should be able to handle the 
first 3 categories at the stated accuracies under ideal conditions 
which perhaps obtain some 70% of the time. However, the new algorithms 
which we seek to utilize or develop under NASA funding should handle 

all 4 classes under 90 to 95% of all lighting and weather situations 
(ercept extreme darkness) . Such all-weather performance will be 
required in order to achieve acceptance by the blind community. 

For our currently NSF funded project (Phase I) , we plan to evaluate 
the apparatus with blind users. An investigator, utilizing a visual 
scoring rechnique, will tabulate the classes of obstacles encountered 
by button presses which will be recorded on a portable tape recorder. 
Results will be summarized and played back through a microcomputer to 
collate and analyze the performance and statistical data in all desired 
categories. 
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APPr.NlUX c 

ItlAOF, DKJITIXATinf! AP!' I'lLl’ FORMAT 
T n n e lUp, 1 1 1 Ka 1 1 on P r o c t> r* ii r 

35 riM ImcJFCH were uftitiR an Opt roiil cn O-IROO Color 

S c a n n e r at }> FI I n t e r n a 1 1 o uni. (v n c h h L r I p \i a b b g a n n e tl a U 5 
microns per pixel resolution four IM’nes usinp, clear, rer!, *'rt;eri, 
and blue color filters. I)lRltlf!e<! pixel value Is relatr.ul to film 
d(Mislty and file trausnisslon by the follovdn.M fornulns! 

pixel value “ 255/3*0-1 iln density) 
pixel value “ 25 5/3* ( 3-lop,ba«olO( transnifs sion ) ) , 

V7here -3 leq flensity leq 0. 

The boundaries of the individual frames v/ere located uoin," an 
1 M a ;; e scrolling program ( S C R 0 L b ) C a c h frame in all four color 

bands V'/aa extracted into Inayes v/hich are 700 pixels wide by 512 
pixels bij»l(. • 

linage File F o r ma t 

The di'jlt Lxed IniaRos are stored In files which consist of an 
d-byte 'lender followed l)y scan lines in 1 u f t '■ t o - r i <» b t , top-to- 
botton order. KacU lieadcr has exactly the sane Information: 


bytes 

0 , 1 

c 0 n Stan t: 

1100 dec Inal 

1) y t e s 

2,3 

n u n b e r 

of 

pixels per line 

h y t c' s 

A, 5 

numbe r 

of 

lines per ina;'e 

bytes 

n,7 

n u M b 0 r 

of 

bit s p 0 r pixel 


« 70 0 
« 512 


A total of CR files were produced: four each for the twenty- 
r. v; o 3 5 lam nos. The files nr e n a m e d ns folio w s ; I-’ *’ AO )! A . , 
FbAMKA.R, FPvAflP.A.G, FRAMFA.B, F R A II F. V . , FlUliKV.R, F IIA ! !', V . 0 , 

and FRAMFV.r,. The suffix to the s t r i n t> "FRAMR" indicates one of 
22 alphabetical letters in the ran:;e A throuj*h V; anc' the file 
n n in e e x t e n s i o u s . W , . P. , . (! , and . R refer to t li c clear ( w li i to), 

red, ijreen, and blue color filLors, respectively. 


EfTecls of Vicifi^ini ScparaUon on Slereopis 


Stereo imagery Is characterized by severa) parameters. These include the 
linear separation between the two cameras, termed the stereo baseline, the field 
of view and resolution of the cameras, and the distance to the three dimensional 
objects to be viewed. The stereo resolution of such a system can be defined in 
terms of the minimum depth separation between two objects that is large 
enough to be detectable by the system. Within the sidewalk environment the 
ability of a system to detect certain impediments depends upon the stereo reso- 
lution. For example, detecting step downs, holes, and broken sidewalk slabs 
depends upon the system being able to detect depth discontinuaties in the nor- 
mally flat sidewalk surface. The difference in distance from the cameras to the 
edge of a hole and to the bottom of a hole must be greater than the system's 
stereo resolution if the hole is to be detected. For a desired depth resolution 
Ad, the minimum required stereo baseline b can be determined in terms of the 
other parameters. 

Let the field of view of the camera be denoted by the angle The monocu- 
lar resolution of a camera is defined in terms of the number of pixels in the 
image. The camera can only detect features two pixels or larger in size (this is 
the Nyquist sample size,) Given a camera image of n by n pixels, for two points 
to be distinguished as separate they must lay further apart than s = 
4*f*tan(^p)/n (where f is the focal length of the camera and s is the distance on 
the focal plane, both in the same units, inches or centimeters.) Ad will denote 
the minimum incrimental distance beyond a point at distance d from the cam- 
eras that can be resolved. The figure displays the camera geometry of objects 
at distances d and d + Ad. 
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The point nt distance d from the right camoi u iviU appear at a location u on 
the Image plane of the left camera given by the equation u » b*f/d. The point at 
distance d + Ad will appear at u* s b*f/(d+Ad). In the Image plane of the right 
camera there will be no difference between the projection of the two points. 
Thus for the system to distinguish between an object laying at a distance d and 
one laying at a distance d + Ad, the quantity u - u' must be greater than the 
detectable point separation distance s from above. This leads to the equations 


u - u' 


b’f‘^d 4‘/‘tan(y) 

d*(d4'Ad) n 


solving for b: 


b > 


4’d‘tan(»)’(d4-Ad!) 

n'Ad 


This equation gives the the minimum camera baseline required for a desired 
stereo resolution. From this it can be seen that higher stereo resolution Is 
obtainable at the cost of a longer base line, higher resolution, narrower field of 
view, or lesser distance to the objects of interest. For example, given a field of 
view of 60 degrees, a camera resolution of 512 by 512 pixels, and an object dis- 
tance of 15 feet, a baseline of nearly 21 inches would be required to obtain a 
stereo resolution of two feet, Because of perspective effects, a two foot stereo 
discontinuity is produced by objects eight Inches or taller on the sidewalk, and 
thus such a baseline suffice to detect most sidewalk curbs, 
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LETTER TO 


AI EXPERTS 


FAIRCHIU9 

A SchlumbcrQtr Company 


TMtm»h§y Oravp 

4001 Mirandi A¥§nu§ 

P»lo AHo, CMliiorni* 94304 
T9l§phon9 415:493^3100 
TWH 910/370^5435 


July 6, 1901 


Dear Colleague (s) , 

We seek your collaboration In a coimnunity-wide study to 
identify useful vision algorithms for incorporation in a blind 
mobility aid. This effort, sponsored by NASA (through the Biomedical 
Applications Team at Stanford Uni/ersity) , is being undertaken to asses 
whether a useful prosthesis can be constructed with state of the art 
techniques and to ascertain where additional research is most needed. 

Specifically, we are asking participants to evaluate empirically the 
performance of their most relevant, operational algorithms on a 
database of representative images. 

Urban street scenes have been chosen as the experimental domain 
(see enclosed sample photos) . A useful mobility aid must be able to 
identify clear navigable paths (e.g., the sidewalk) and detect the 
presence of significant obstacles (e.g., curbs, chuck holes, telephone 
poles, garbage cans, buildings and vehicles, as opposed to shadows). 
Semantic labeling of obstacles is desireable but not essential. 

The enclosed paper provides further background on the difficult visual 
processing problems that arise in building a mobility aid, and 
describes a first attempt at an integrated solution. We are 
interested in better algorithms for accomplishing various stages of 
processing within the context of that system, as well as algorithms 
motivated by research in computational vision, that could overcome 
fundamental limitations. Within this broad charter, many types of 
algorithms are potentially relevant: for example, correlation tracking 
(e.g., for following the center line of the sidewalk), segmentation 
(e.g., for extracting long lines, smooth curves, and homogeneous 
regions, associated with the sidewalk and major obstacles), 
interpretation of pictorial features as physical scene events (e.g., 
as shadow or occlusion boundaries), recovery of intrinsic surface 
characteristics (e.g., range and orientation mapping using shading, 
texture gradient, stereo, motion stereo, etc.), extraction of 
prominent three dimensional surfaces (e.g., the ground plane, vertical 
surfaces) and volumes (e.g., cylinder representations of obstacles 
such as telephone poles and cars) , semantic or schema based 
interpretation of pictorial or 3-D features, establishing symbolic 
correspondence between features in successive views and so forth. 
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We have compiled a modest ima9e data bate foe use in evaluation# some 
samples of which a::c< enclosed. All scenes ace avail >ible in color and 
stereo, in some cases, foveal views of interesting featuces and time 
sequences of imagery (in the form of 16mm movie film shot while 
walking along the sidewalk) ace also available. This data will be 
available both in hard copy (as color prints or positive 
transparencies and digitized form (via tape or ARPANET) . The preferred 
form of output is graphical overlays (e.g., delineations of object 
boundaries) superimposed on the original image. Other forms of output 
(e.g., numeric arrays) may also be provided to permit more detailed 
evaluation of algorithm performance. 


We hope you will be able to participate in this worthwhile cause. In sc 
doing, you will also be contributing to the quality of computer vision 
research by helping establish a precedent for competitive algorithm 
evaluation on standard data sets. As an added incentive, a small 
honorarium ($150) will be awarded for each of the 20 most promising 
algorithms. More substantial consulting funds may also be available 
for refinement and/or further evaluation. 


Please return the enclosed gustionaire as soon as possible. We plan 
to begin distributing imagery immediately on receipt, and would like 
to receive your results for compilation by the end of the summer. The 
results will be disseminated as a report and, if appropriate, at some 
suitable conference. 


iincerely 




. Tenenbaum 

|alr^ild AI Re^arch Laboratory 
Michael Deering 

University of Californ'la, Berkeley 
Smith Kettelwell Institute of 
Vision Research 


and 


P.S., Recipients of this letter and some algorithms of interest are 
listed on the next page. Suggestions for other participants and 
algorithms relevant to this study would be greatly appreciated. 
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Partial list of potential 
algorithms: 


Harlyn Baker 
Dana Ballard 
Ruzana Bajscy 
Steve Barnard 
Tom Binford 
Rod Brooks 
Dave Burr 
Bob Cunningham 
Larry Davis 
Martin Pischler 
Don Gennery 
Eric Crimson 
Marsha Jo Hannah 
Bob Harralick 
Ellen Hildreth 
Berthold Horn 
Takeo Kanade 
John Render 
Martin Levine 
David Lowe 
Worthy Martin 
Dave Milgram 
Hans Moravec 
M. Nagao 
H-H Nagel 
Ram Nevatia 
Yu-ichi Ohta 
Sandy Pentlin 
Walt Perkins 
Slava prazdny 
Keith Price 
Lynn Quam 
Raj Reddy 
Ed Riseman 
Azriel Rosenfeld 
Jay Tenenbaum 
Shimon Ullman 
Jon Webb 
Tom Williams 
Andrew Witkin 
Steve zucker 


participants and potentially relevant 


(edge^based stereo) 

(Hough techniques) 

(texture gradient) 

(vertical surface finder) 

(edge detection and interpretation) 

(ribbon finder, model-based object finding) 
(image registration) 

(motion tracking) 

(segmentation, texture algorithms) 

(linear features, analysis of range data) 
(stereo ground plane finder) 

(stereo) 

(bootstrap stereo) 

(facet model) 

(primal sketch) 

(shape from shading -i- contour) 

(range finder, texture) 

(texture) 

(segmentation algorithms) 

Uurface interpretation) 

(motion) 

(interpretation of range data) 

(stereo, motion, obstacle avoidance) 
(segmentation) 

(analysis of image sequences) 

(edge detectionr motion stereo) 

(region analysis) 

(shape from shading) 

(concurves) 

(range, orientation from motion) 

(region extraction) 

(correlation tracker) 

(region extraction) 

(edge and region extraction techniques) 
(relaxation enhancement) 
(interpretation-guided segmentation) 

(shape from motion) 

(motion) 

(analysis of image sequences) 

(occluding edges, shape from texture) 
(segmentation algorithms) 
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Name: 

Address I 

Description of algorithms: 

(please answer for each algorithm you plan to evaluate, 
using additional pages as required) 

Type of algorithm (<idge detection, stereo matching etc.) 

Input (image, line drawing, range array, cylinder model etc.) 

Output (image overlay, edge or region data structure, orientation 
array etc.) 


Principles of operation (brief description) 


Envisaged role in a mobility aid. 


Implementation Details 

(hardware, software, memory requirements, execution speed) 


What types of experimental imagery will you need from us? 

(Type of scene-ref er to numbers on back of sample photos; 
color, stereo, field of view; resolution and format of digitize 
data, if desired, else size and nature e.g«, transparency, glo 
of hard copy data.) 
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In vfhat focni will you supply output for •valuation 

(photograph of displayed results overlaid on original image, 
printout, tape dump of data structures etc.) 


notes Transfer of imagery and results in machine readable 

form is preferred. Digitized imagery will be provided 
as files, one record per row of the image array. 

The same format can be used to return results 
as arrays of labeled pixels, such arrays 
provide a uniform means for representing results at 
many levels of processing (e.g., range values, edge or 
region labels, semantic labels and so forth.) They 
are easy to display and compare, and are readily 
transformed into other data structures, photos of 
graphical output are also acceptable, and would 
be appreciated, in any case, to permit an initial, 
qualitative evaluation. 


Comments 

(suggestions for other participants, general comments on 
running this study) 


♦ . 
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■f 


ORIGINAL iv 
OF POOR QUAUTY 


(fl 

(U 

o 

1 

0) 


c 

P 

fo 

Id 

t- 

r*“ 

P 

T3 

D. 

P 

c 

•a 

0. 

<d 

'P 

p 

c 

a 

p 

o 

o 



u 

to 

P 

cn 

c 

D. 

<1- 

o 

01 

•f“ 

U 

o 

cn 

0. 


<u 

<u 

c 

0. 

a. 

o 

»r» 

o 


P 

p 

p 

fd 

c 

Q. 

u 

•r- 

(U 

•r* 


TJ 


0) 


•r* 

cn 

u. 

P 

<d 

o 

c 

E 

P 

■u 

•r« 


x> 

•r* 

cn 

1 


c 

P 

•a 

•r* 

*»- 

•r* 

C 

0. 

ta. 

o 

o 

flj 

*r“ 

cn 

t. 

P 

r-" 


•r" 

*d 

L. 

P 

O 


cn 

P 


c 


Ou 

•1" 



to 

in 

P 

O 

o 

•f-* 

P 

u 

u 


o 

o 

o* 

J 

0. 

o. 

r-* 

P 


•d 

•r* 


cn 

l. 

0. 

O 

o 

C 

O) 



.2 ^ 

•M +J 

<J •»- 

lO (. 

1. o 

O) 

K ^ 

o) (d 


OP 



i«- 

P 

P o 

s 



i r 

■K* 

P 


L* 

O 

rC C 

•r» 


cn 

P *r* 

C. 



•r» P 

o 


<0 

i~. «d 

cn 



o c 

f— • 


cn 

0>>r» 

<d 


c 

r-“ 2 



•r* 

«d u 

cn 


JC 

(U 

c 


u 

o»p 

•f- 


p 

C (U 

x; 



•r- *0 

u 


E 


p 


1 

U T> 

Si 


c 

4J C 

E 


o 

Id <d 

1 


•r* 

E 

c 


cn 

( • 

c c 

o 

•r» 


<u 

U 

o o 

cn 

p 



00 < u 

^ oo 

*> >» 3 

u Zy 

CU •f” • 

^ </t rd 

•0 L *r“ 



■r- p 

> 

0 ) -o 

J= O 

C *r* 

O f— 

in >> 

(d C 

1- *d 

(d u: 

r— 3 

o c 


S. 

o o 

rtJ 

Id 

O') o 

P 

X 


C cj 

O «0 00 

lO 4 -» 3 

a. o 

E w •> 

O Oi (U 

;C C O 

H- C ff 

•r* im 

E S: Qu 

»f- j: 

r~ O -P 


of Rochester 


Experimental Algorithms: Monthly Aid for the Blind 



“O 

•o 

•o 

TJ 


0» 

(U 

Of 

01 


'r* 

4- 

*r- 

H- 

'r- 

4- 

•I— 

4- 



•f* 

•r* 

•r~ 


U 

U 

U 

O 


<u 

<u 

01 

(U 

O 

Cl 

a. 

Q. 

C3. 

Cn 

in 

in 

t/1 

•n 

r~ 

c 

c 

C 

c 

< 

r) 

=3 

=> 

=3 


‘r— 

o> 


tn 

« 4J 


u 

•r- 3 


0) 

c 4J 


■a > 

•r" >r* 


C ‘f- 

0>4j 

4? 

ro c 

S- in 

1 

ra 

t- c 





•o 

> »-< 

Of i. 




f** A 


•r* Of 




Of (0 T) 

« O 

O) > 

•-H 

j 


4- *r- C 

•!“ 

Of •(” 

s 

1 » 


S ^ 

u c 

c c 




Of <d r— 

•r* jC 

t- =9 





1— O 

Bf 




o s_ 

Bf Of 

o n 

G? 



0£ iO 

t. 4-> 

o 




s< 

« >» 

^ I—* 

T) 


<u 

I" 

Sf- 

Of r- 

rd 


u 

Of BJ 4- 

O 

•O Of 

u 



•r* «J O 

• 

BJ 2 

CO 

11 

3 

O 

t. 

N 

s: 

• 

c 

Bf 


1 

to 

< 

Q£ 








APPENDIX G 


original p ' 

OF POOR QUa^.V 


Real-Time Natural Scene Analysis for a Blind Prosthesis 


Michael Deerlng 

Computer Science Division. Department of EECS 
University of California, Berkeley 
Berkeley. California 94720 


Carter Collins 

Smith'Kettlewell Institute of Visual Sciences 
San Francisco, California 04115 


• This work was supported by NSF grant no. PFR-7908299 from the Science and 
Technology to Aid the Handicapped Program. 


ABSTRACT 

A reaHime computer vision system designed for the limited environment of 
city sidewalks is presented. This system is part of a prototype mobility aid for 
the blind. The overall device endeavors to keep blind pedestrians on a safe path 
down the sidewalk, and also warn of upcoming obstacles. The scene analysis 
algorithm uses semantic models of the environment to interpret edges in the 
multi-frame image data as borders of various objects, as well as to assign dis- 
tance estimates to these objects. The input is a 64 by 64 by 6 bit gray-scale 
image taken from the vantage point of the shoulder of a pedestrian once a 
second. Along with each image, the three dimensional transformation of the 
camera location since the previous frame is assumed to be provided by 
hardware. After an initial segmentation into edge lines represented as arcs of 
circles, predictions of edges (generated by analysis of previous frames) are used 
to identify edges in the current frame. Edges not identified by this process are 
incorporated into the portion of the three dimensional world model that they 
are the most consistent with, The induced three dimensional world model of 
objects can then be used to provide mobility information to the blind user. The 
emphasis throughout the system has been on efficiency. The design trade-offs 
and techniques used to obtain high processing rates are discussed. Most of the 
vision system is currently running in real-time on a 16 bit micro-processor. 
Field trials of the complete prototype device will begin soon. 

I Introduction 

An effort to produce an optically based electronic mobility aid for blind 
pedestrians has led to the development of a natural scene analysis program for 
the typical scenes encountered by a pedestrian. The restriction to the seman- 
ticly rich domain of city sidewalks has allowed the visual processing to be per- 
formed in real time on a 16 bit microprocessor. The nature of the task is such 
that perfect object detection and recognition are not required, but rather the 
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probabilistic detection of potential obstacles. The main goal of Iht system Is to 
determine approximately where the sidewalk is (with respect to the user), and 
secondarily to warn of objects blocking the path ahead. This task must be per- 
formed in real-time on real-world sidewalks. 

Most real-time vision systems to date have dealt with very constrained 
image domains, due to the enormous computational requirements inherent in 
visual processing. These include industrial parts recognition [1], blood cell 
counting, and automatic navigation. Faster hardware and more efiicient 
software techniques will gradually allow more complex domains to be handled at 
high speeds. Our approach has been to start with very fast segmentation, and 
amortize semantic processing over several frames, utilizing predictions from 
models of previously recognized objects to guide the parse of the current image. 
At the high end, our system is similar to many semantically oriented systems, 
such as [2] and [3]. Finally our use of multi-frame data is similar to many 
aspects of [4] [5)[6][7][B]. 


II Constraints 

In order to achieve real-time processing, we have had to impose several 
constraints upon the operations of the system: 

1. Input is restricted to clean, sun-lit, mostly shadow-free sidewalks. 

2. Certain initial starting conditions will be supplied to the system from the 
outside (which way the camera Is pointing, where the sun is, etc.) 

3. False positives are allowed (it is OK to occasionally warn about non-existent 
obstacles) 

4. We must accept that within our resolution and processing time certain 
classes of objects are undetectable, These include objects whose width falls 
below the Nyquist sampling rate of the camera (mainly skinny poles), and 
objects with very low contrast with respect to the background, or those 
against a wildly changing background. At the same time, we wished to con- 
struct the program modulely, allowing knowledge about objects and scenes 
to be separated from the control structure (but without sacrificing 
efficiency.) 

Ill Overview of the Program 

The input scenes are successive 64 by 64 by 6 bit gray wide angle images 
taken from the vantage point of the shoulder of a pedestrian, at a rate of one or 
two frames per second. This relatively low resolution is the highest passible 
under the hardware and timing constraints. The overall organization Of the pro- 
gram is: After video acquisition of the input scene, digitization and noise- 
removal, the information processed in three passes as follows; 

1. segment the picture into linked chains of edges, 

2. fit curves to these chains and put the mathematical description of the 
curves into an associative data-base, and 

3. match these curves against several data bases (the world model) which 
include curve predictions from previous scenes. 

The results of these matches identify semantically the objects belonging to 
the curves. Knowing whether an object is horizontal or vertical allow one to pro- 
ject the curves out into three dimensional space to determine their direction 
and range. Further heuristics are employed that utilize location information 
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from previous frames to make an independent motion stereo based estimate of 
the object ranges. At this stage the program should know where the sidewalk 
and any close obstacles are located, and can proceed to output this information. 
(Obstacles are the common sorts of large physical objects that one might 
encounter on a city sidewalk: phone poles, lamp posts, fire hydrants, trash cans, 
sign posts, automobiles (ofl to one side), parking meters, trees, bushes, etc.) 

lY Coordinate System 

The coordinate system used for the three dimensional outside world is cen> 
tered on the focal point of the camera as it moves through space. The Z-axis is 
oriented in the direction that the camera is pointing, the Y*axis points straight 
up. and the X-axis points to the right of the camera. Thus the three dimensional 
location of all objects is always determined relative to the pedestrian, and are 
re-computed each frame. Points in the world are mapped to points in the image 
plane through the usual projective geometrical equations, 

One of the fundamental problems of computer vision is that this projection 
of the three dimensional world (X.Y.Z) to the image plane (x,y) cannot be 
reversed without additional data of some kind. One of the main goals of the 
semantic phase of our system is to provide this additional data via semantic 
knowledge about the probable locations, orientations, and relationships between 
typical objects encountered within the sidewalk environment. This additional 
information usually is in the form of a hypothesis on the value of one of X,Y, or Z. 
Given this value, along with the image plane feature Ideation (x,y), the remaining 
two three-dimensional coordinates can be found by suitable manipulation of the 
projection equations. 

The motion of the camera in the world between frames will cause the pro- 
jections of edges of three dimensional objects onto the image plane to change. 
The general case of the camera transformation involves six parameters, 
(AX,AY,AZ,d,{ 0 ,p). For a camera mounted upon the shoulder of a pedestrian it 
can be safely assumed that AY and p are approximately zero, as there is little 
torsional rotation, and the height of a particular pedestrian remains roughly 
constant. (It should be noted that the mechanics of the human visual system 
goes to a great deal of effort to keep p near 0, a p rotation of up to six degrees of 
the head is countered by an opposite rotation of the eyeball in the socket. The 
shape of the boroptor in many animals indicate that the height of the animal is 
taken to be a constant for some visual processing.) The equations to perform 
this transformation can be combined with the image plane projection equations 
to obtain the location within the current image plane of a world point from a 
previous frame. 


V The World Model (the Sidewalk World) 

Our world model is the typical sidewalk environment as encountered by a 
pedestrian. Most modern sidewalks are constructed of slabs of white concrete, 
and are three to twelve feet wide. Many run in straight lines for an entire block 
before ending in a corner, while others may be curved. For simplibity it is 
assumed that all sidewalks encountered by the vision system are straight for 
thirty feet beyond the camera unless a corner is ahead. Sidewalks mainly differ 
in their width and the presence (or absence) of a grass border on their street 
side. These variations are modeled by a few simple parameters. 

Within the 64 x 64 image the borders of the sidewalk on either side will often 
appear as high contrast lines. These lines will be in the lower half of the image, 
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at a highly inclined angle. Various objeots bordering on the aidewalk tomeiimes 
are of similar optical intensity, reducing the contrast of the sidewalk edges, k 
large variety of objects can border city and suburban sidewalks. Tliese include; 
bushes, shrubs, trees, grass, dirt, driveways, walkways, walls of all types, doors, 
and windows. On the street side usually one finds pavement and automobiles. 
These objects occur at fairly predictable locations, and many times with good 
contrast compared to the white sidewalk. 

Most objects located upon the sidewalk proper have three fortunate proper* 
ties: they do not move, have stereotypical locations with respect to the edge of 
the sidewalk, and usually do not block th^ path. Objects in this class include: 
phone poles, lamp posts, Are hydrants, traffic signs, parking meters, trees, mail 
boxes, phone booths, most trash cans, and bushes. Many of these objects also 
have the property of being rectilinear, and approximatable as cylinders or boxes 
(and thus produce good, high contrast edges.) 

Unfortunately ether objects are more unpredictable. These include: paper 
boxes, bags, newspapers, trash, garbage cans, parked bicycles, and badly 
parked cars. Such obstacles can appear anywhere on the sidewalk, and are not 
always very rectilinear. However, they do have some properties that facilitate 
their detection. Many are short, so their borders are within the two edges of the 
sidewalk. They also rest directly upon the ground, enabling their distance to be 
accurately determined and verified over several frames, 

Finally there is the class of moving objects which are the hardest to handle, 
as their shape and location may change drastically from frame to frame. This 
class includes: pedestrians, dogs, bicycles, occasionally a car crossing the side* 
walk, and wind-blown trash. Fortunately, most mobile obstacles are alive or con* 
trolled by humans, and will try not to collide with a pedestrian. 

VI Low Level Processing 


Initial segmentation of digital Images Is now a fairly well developed art, but 
no general technique is know to be (close to) optimal for the large class of 
natural images, and the speed of various algorithms can differ by orders of mag* 
nitude. The necessity for real-time performance is the most severe constraint 
imposed upon our system. Many promising segmentation techniques had to be 
rejected out of hand on efficiency grounds. 

The speed of the initial segmentation algorithm dominates the performance 
of the overall system, as the semantic phase is usually many times faster [1]. 
Thus a poor quality (but fast) segmentation algorithm may be preferable to 
higher quality (but slower) algorithm if one can make the semantic phase work a 
little harder. This is the case in our system. Our segmentation algorithm only 
directly compares two pixels at a time, and thus is sensitive to noise, but runs at 
a very high speed. In some sense every module in the system after the pixel 
comparison has some component who's job is to help correct for the defects 
introduced by the initial fast segmentation. 

The segmentation algorithm used is an edge following algorithm that differs 
from the usual (such as described in [9]), in that we follow several edges simul- 
taneously. Most edge followers grow an edge line point by point serially from 
one end of the line, Ve instead grow many edge lines in parallel by adding to 
both ends of many edge lines simultaneously. The advantages of this method 
corresponds roughly to those gained by a breadth first versus a depth first 
search, in that there is more global information available when one ts forced to 
make local decisions. This allows edge thinning to take place at the same time 
as edge following, and contributes to the speed of the algorithm. 
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In more detail, edge points In the Input are found using a pseudo*random 
scan [10], In the area around each point a search is made for existing edge lines 
and other isolated edge points. Based on complex decision rules, an existing 
edge line may be extended to the new point, a new edge line may be created 
between the new point and another point, or two edge lines may be Joined 
through the new point. (These rules are similar to those found in [11], but with 
less complex weightings,) The decision rules mentioned above help thin the 
edges, mainly by forcing edge lines to be essentially continuous. The Anal result 
of pass 1 is the collection of edge lines obtained after all the edge points have 
been processed. 

The edge follower tends to be conservative, as it knows that pass 3 will con* 
nect broken lines. This is possible as pass 3 has access to more global informa* 
tion about the objects within the scene, end thus may have reason to believe 
that three roughly collinear tine segments may in fact be the edge of one object. 
Pass i does not have enough information to decide if a gap between two line seg* 
ments is due to noise breaking up a single edge, or is really an occluding object 
or a gap between two objects. 

In our system, noise in a single pixel many times can lead to the break-up 
of a potential edge line into two pieces, The defecU in this edge segmentation 
can be modeled as higher level noise. That is, as broken edges, missing edges, 
and misoriented edges. Semantic rules about line segments can take these 
defects into account, and sometimes even make use of certain properties of the 
"noise“. For example, the breakup of a edge line corresponding to the edge of 
an object in the scene may be caused by a surface discoloration near the object 
edge. If this is the case, then the "noise" will be serially correlated from frame 
to frame. Thus the broken edge will be broken In the same way in several 
frames, and the relative location of the break can be (and is) used as a feature 
of that edge (which can help to re-identlfy it in successive frames.) 


Vn The Fitting of Edges to Curves 


Pass 2 gathers information about each edge line, summarizes Its attributes, 
and sorts it to permit quick searches. Pass 1 sorts the edge lines by x-y location 
and computes their length. Pass 2 computes their "curvature" and angle of incli* 
nation from the horizontal, and sorts them by angle. Edge lines that are too 
convoluted are broken up into smaller (and simpler) segments. This covers the 
few cases in which the conservative edge follower described above is not 
suRiciently conservative. This can occur when two straight line segments inter- 
sect at a shallow angle. Pass 1 cannot distinguish this case from a single shal- 
lowly curved line segment. Pass 2's curvature statistics are needed to resolve 
this case. 

We devised a fast algorithm to fit lines to arcs circles. The main point for 
our application was to within a mllli-second classify a given line segment as a 
roughly straight line, a gently curving line, or noise, It is possible that in the 
future we may make use of finer details of the curves, but in an environment of 
rotating three dimensional objects absolute curvature is of little use, and more 
general information on object surface orientation and distance is needed (such 
as discussed in [12].) 

Vin Representation of Objects 

Our three dimensional representation necessarly emphasis the edges of 
objects, given the nature of our initial segmentation. The representation 
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roughly resemblei a collection of three dimensional edges of the object, but the 
locations of the edge lines relative to each other is not fixed as in 3D wire frame 
models in computer graphics, but rather are allowed to vary as needed. As most 
objects in the sidewalk world are somewhat rectilinear, many are represented 
by planes parallel to the X,Y or Z axis (a similar representation was used in [4],) 
For example, a phone polo can be fairly well approximated by a rectangle facing 
the observer. The sidewalk proper is approximated by a rectangular slab who's 
position and width is updated every frame with new data. 

To achieve high processing speeds, some of our representation Is pro* 
cedural rather than semantic. But as we have built up the number of objects 
that we handle, a number of common subroutines have emerged, allowing new 
objects to be added and represented fairly easily, High processing speeds 
verses separation of control structures and knowledge are not necessarily 
incompatible, but to obtain both one must have intervening software etep that 
transforms high level abstractions into a form combinable with control struc* 
tures. It also helps to have a very flexible control structure. Our system puts 
both edge data and object building procedures into associative data-bases, thus 
allowing the flow of control to be determined by the data. In retrospect, most of 
the object handling procedures could have been generated by machine from 
static descriptors rather than by hand, and we mby go to such a system in the 
future. 

DC Representation of Visual Knowledge 

With the knowledge of how to recognize and represent objects handled by 
the object representation, the remaining visual knowledge of interest is that 
that tells you whtrg an object is (it's distance and direction.) A number of 
huerislics of varying degrees of generality exist to do this job, with varying 
degrees of accuracy and constraints. These are: 

■1. If the object is know to be resting on (or very near) the ground plane, then 
Y is known to be -UserHeigth, and X and Z can be obtained by back projec* 
lion. 

2. If the object has a known distance from the edge of the sidewalk (for exam* 
pie, phone poles are usually one foot in), and all one has is a piece of an 
edge of the object (usually not the ground plane intersection), then one can 
obtain the objects (X,Y,Z) location as follows: take the image plane (x,y) 
location of the edge piece, project it as a line through the origin (the focal 
point of the camera) into (X,Y,Z) space, and intersect this line with the 
plane which is the constant distance in from the sidewalk. This Intersection 
X and Z will be the object's location on the ground plane. (Die equation of 
the plane parallel to the sidewalk is obtainable because the equation of the 
sidewalk edge is assumed to be known.) Even if the constant distance in 
from the sidewalk edge is incorrectly guessed, which can lead to distance 
error on the order of 50% or more, at least some distance information has 
been provided, and one can make simple decisions like "will I run Into it in 
two seconds or twenty seconds". In future frames, motion stereo can 
tighten up this distance estimate (and correct the constant distance term). 
By the time an object Initially sighted twenty or more feet away comes to 
within flve feet of the pedestrian the location of the object will have 
appeared in thirty frames of solid data, hopefully pinpointing it's location to 
witMn a foot. 
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3« Once the elTecti of cemere Ult end pan have been aubtracted. the 
difTerences in locationa of a feature In lucceislve framei can be used to 
determine its range and distance by working backwards from the projection 
and camera transformation equations. We employ motion stereo as a 
secondary distance cue that is used to check up on our primary cues 1 and 
2 above. 

4 . There are some distance heuristics that are only of use for determining the 
equation of the sidewalk's borders. These include making use of the known 
constant width of the sidewalk.* 

5. Finially, the location of an image feature relative to the current interpreta* 
tion of scone can be used to obtain a probable distance estimate. For 
example, edges near the vanishing point of the sidewalk are probably (but 
not necessarily) far away. Edges way off to one side and in the sky most 
likely belong to upper stories of buildings or the background, and may be 
safely ignored. (One misses overhangs this way, but overhangs in general 
are very hard to recognize, as many are of very low contrast to begin with.) 

X Semantic Analysis 

The semantic phase endeavors to build a three dimensional model of the 
outside world that it is moving through, such that edges in the image frames 
correspond to edges of objects in three dimensirmal model, The various dis< 
lance hueristics listed above are employed both to initially place objects as well 
as to verify their location/identity over several frames. (An object whose dis> 
tance varies wildly from frame to frame may be mis-ldentifled.) Further con'- 
straints exist that simplify the semantic analysis task. These are*. 

1. Most objects in the sidewalk world rest on the ground plane (though we do 
- not assume that their point of contact is visible.) 

2. Most objects can be roughly approximated by planes parallel to the x, y, or 
z axis (as in [4].) 

3. Location accuracy need only be enough to avoid objects mosf of the time, 
For example, distances to objects need only be computed with an accuracy 
of ^20% when objects are closer than 6 feet, and when objects are 
further away. 

4. The camera transformation will be correctly supplied most of the time (by 
hardware) to within \% angular accuracy and 10% translation accuracy. 

In order to speed the identifleation of edges in a new frame, predictions of 
edge locations from previous frames are used. Much of the speed of the seman- 
tic pp.as is due to the essentially hardware solution of the successive frame 
registration problem. Most re-occurring edges can have their location in suc- 
cessive frames determineed to with a few pixels be using the camera transfor- 
mation. The overall effect is a sort of "boot-strapping" re-identifleation of scene 
features, similar to that described in [13]. (It may be possible that in the future 
we can dispense with the special camera motion tracking hardware and recover 
this information incrementally from optical flow.) 

In more detail, the semantic phase is broken up into several subparts, 
These are: 

1. Edge tines from the previous frame are flrst transformed by the camera 
transformation and then matched against edge lines in the current frame, 
current lines that appear to be direct descendents of previous lines are 
removed from the current data base as "explained". The order in which old 
object's edges are searched for is determined by their relative semantic 
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importance and data quality. Thut the edgee of the lidewalk are usually 
searched for first, followed by the other objects roughly ranked by their 
number of (visible) edges. 

The matches made in 1 induce new information that can be used to con* 
struct an updated three dimensional model of the objects that the edges 
belong to. These models can then make claims for gaps in t leir edge out* 
lines. 

3. The claims made in 2, along with various generic elatms for new objects are 
matched against the remaining data base of edge lines. Residual lines will 
be claimed as background noise. 

4. New object edges obtained in 3 allow for further updating of the three 
dimensional models, which at this point can be used by the blind navigation 
system, Predictions and search scheduling for the next frame are made at 
this po nt. Objects who's existence is no longer supported by the edge line 
evidence are deleted in favor of more consistent interpretations. 

Thus at any one point in time the world model data base of the system con* 
tains models of several objects (phone poles, bushes, automobiles, etc.) that are 
moving by. as well as a model of the sidewalk proper. 

X! Expjrimenial Results 

Figure 1 is a sample image taken from an 6mm, movie of a sidewalk. One of 
our test sequences consists of 30 digitized images from this movie, The forward 
motion between each frame was one and a half feet. On our half speed XL60,000, 
passes 1 and 2 can process this movie at the rate of one second per frame, The 
semantic processing of pass three takes an additional half second per frame. 
When applied to this movie, the system correctly discovered and tracked the 
sidewalk edges, as well as edges of several objects off to the right of the side* 
walk. No objects were found to be blocking the sidewalk, Figure 2 displays a 
digitized image from the middle of this sequence with the wire*frame model of 
pass 3 superimposed, (A similar At is made for each frame of the movie.) A 
computer animated reconstruction of the outside environment given the world 
model produced by pass 3 is seen in figure 3. (The detail on the parked car is 
simulated.) Figures 4. 5 and 6 display three other digitized scenes from earlier 
in the movie, with the wire frame model produced by that system for that frame 
superimposed. We expect to be running field trials of the entire system In a 
portable cart trailing behind a blind subject shortly. 

XII Incorporation into a Blind Aid 

The overall functioning of this system as a blind aid is part of the lineage of 
a large number of previous tactile blind aids devices designed over the last 
twenty years [14][15][16]. The computer vision component and the blind inter- 
face component have been separated out from each other via the following rea- 
soning; 

1. Assuming a perfect computer vision system that knows where every object 
of interest is located, how could one best communicate this information to 
a blind pedestrian? What sort of user interfaces and interactions will allow 
the user to make rapid, accurate use of the information provided? 


OKir.lNAL PmuC 

BLACK AND WHlTe PHOTOGRAPH 







. 9 - ORIQIiNAl PAGE IS 
OF POOR QUALITV 

Z. How does one build e porleble (wearable!) computer vision system that will 
locate at least the majority of the objects of interest? 

Our solution to 1 has been presented tn the previous portions of this paper. 
Our solution to Z is to use two output channels - stereo synthetic speech for cog* 
nitlve (high level) information* and a linear array of 16 tactile elements as a 
pointing device. The stereo speech unit is a combination of a normal speech 
synthesis system with a audio processing unit that con "throw*' the computer's 
speech, allowing It to appear to come from a particular direction and distance. 
(For example, a phone pole might seem to announce "phone pole'*.) The tactile 
array is a skin tapping device worn as a head-band, each element corresponds to 
a particular angular direction, and the frequency of taps of an element 
corresponds inversely to the distance of the feature being pointed to. 

Currently we Intention to hove the speech unit make major announcements 
(the blind don't want it babbling all the time, they listen to sound shadows and 
street sounds.) The tactile output will be used for communicating more mun* 
done Information, such as ''you're veering off to the left of the sldewolk, veer a 
bit to the right", by "flashing'' the edge of the sidewalk on the appropriate side of 
the tactile display, should the user veer toward it, In any case, one of the main 
reasons for putting the whole system on a micro processor, rather than simulat- 
ing it on a mainframe, was to have the capability of expermenting in the real 
world with various blind interface systems. Also, despite years of experience in 
testing blind aids, it is very hard to tell how the blind will react to a particular 
device without letting them make extensive use qf it under real world condi- 
tions. 


Xin Perspectives on Future Directions 

Within the hardware and timing constraints imposed, we feel that the 
current system performs well, and cannot be much improved upon. However, 
for use in a robust blind aid, the system has several limitations which must be 
overcome, These include the low sensitivity to low contrast edges in shadowed 
or complex scenes, and the low resolution ('^1 degree/pixel.) More Important is 
the limitation to processing of edge data only, at the exclusion of surface data. 
It would be nice to have more Information about the sidewalk than just where 
it's edges are, such as how flat it is, are there any broken sidewalk slabs or 
holes? Also, the current system must treat any high contrast edge on the side- 
walk as a potential object edge, even though most are only flat shadows or 
stains. 

Various surface processing techniques proposed in the literature can solve 
many of these limitations. Texture gradients [17] should provide a fairly robust 
broad classifleation of the scene Into flat and upright surfaces. Optical flow can 
provide approximate distance estimates. Luminance gradients could indicate 
surface curvature, which could be used in identifying (and segmenting) phone 
poles, walls, automobiles, etc. [12]. Finally, stereo gradients can not only deter- 
mine general distance estimates, but for the special case of the almost flat side- 
walk. they m be tunned to spot vertical deviations as small as half an Inch 
(such as un»even sidewalk tiles that one might trip upon). More importantly, 
stereo can determine that a dark patch it flat, and can be safely ignored. This 
would be similar to the system described in [IB]. However, for this specialized 
stereo algorithm to work. It must have a very good estimate as to where the flat 
sidewalk is in the flrst place. This is where the other surface processing tech- 
niques enter the loop. Such a system could provide very robust performance 
under even extreme conditions (such as wet (and reflecllvel) sidewalks in the 
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rain), but at the expanse of special purpose hardware. 

Currently we are in the initial stages of designing a surface processing 
oriented version of our system system along the lines described above, which is 
to be implemented in VLSI. This system will be characterize by higher resolu- 
tion (with separate foveai and peripheral resolution), higher frame rates 
(approaching 30 frames a second), stereo processing, and extensive use of sur- 
face processing techniques. It will diiTer from other vision systems in that it will 
be optimized for the sidewalk environment. For example, the stereo section will 
not have to deal with the stereo frame registration problem in Its general form, 
but for the much simplier case of extracting the (mostly flat) ground plane. Kvi* 
dence indicates that the vision system of many animals (including man's) has a 
built in special case solution for extracting the ground plane, which Is similar to 
our proposed technique. 


REFERENCES 

[1] Perkins, W. A. ''A model based vision system for industrial parts.*' IBES 
TVotw. Cbmpuf, vol. C-27 (1978) 120-143. 

[2] Brooks, R, Greiner, R. and Blnford, T. "The ACRONYM Model-Based Vision 
System." In A-oc. IJCAJ-79. Tokyo, August, 1979, 105-113. 

[3] Saphiro, L„ Morlarty, J., Mutgaonkar, P. and Harallck, R. "Sticks, Plates, 
and Blobs'. A Three-Dimensional Object Representation for Scene Analysis." 
In Pfoc, First NCAI, Stanford, CA, August, 1980, 28-30, 

[4] Williams, T, "Depth from Camera Motion in a Real World Scene." /BEE Trans, 
Pattern Analyses and Machine Intelligence, vol. PAMl-2, no. C (1960) 511- 
510. 

[5] Tsuji, S,, Osada, M. and Yachida, M. "Tracking and Segmentation of Moving 

Objects in Dynamic Line Images." IEEE Trans, Analysis and 

Machine Intelligence, vol. PAMl-2, no, 6 (i960) 510-522, 

[O] Roach. J. and Aggarwal, J. "Determining the Three-Dimensional Motion and 
Model of Objects from a Sequence of Images", University of Texas at Austin 
Laboratory for Image and Signal Analysis research report TR-60-2, 

[7j R. Nevatla, "Depth Measurement by Motion Stereo," Computer Oraphics and 
Image Processing, vol, 5, pp. 203-214, 1970. 

[8] M. Levin. "Analysis of Scenes from a Moving Viewpoint*" in ArtiJtciiU Mfelfi- 
gence: An MIT /’erspacfitie, P. Winston and R. Brown, eds.. pp. 167-206 (The 
MIT Press. Cambridge. Massachusetts, 1979.) 

[9] McKee, J. and Aggarwal, J. "Finding the edges of the surfaces of three- 
dimensional curved objects by computer." Pattern Recognition, vol. 7 
(1975) 25-52. 

[10] Kavasznay, L. and Joseph, H. "Image processing." Proe. IRE, no. 43 (1955) 
56-57. 

[11] Prager, J. "Extracting and Labeling Boundary Segments in Natural Scenes." 
IEEE TVons, I^ttem Analysis and Machine Intelligence, vol. PAMI-?, no. 1 
(1960) 10-27. 

[12] Tenenbaum, J. and Barrow, H. "Recovering Intrinsic Scene Characteristics 
from Images." in Computer Hsion Systems, Hanson, A. and Riseman, E. 
eds., 3-26 (Academic Press, New York, NY, 1978). 




f 

• 11 “ 


[13] Kiinnah, U, "Booiitrap Sterto." j^oo. flr»t MM, SianfordiCAi Auguii. 
19B0, 3S-40. 

[14] CotIini» C. *TaptUa ielevlslon: Mtchanical ■nd Bltcirical !mt|t ProJacUon." 
WSE Trans, on JHan-Uachins Systtm, no. 1(1 (1970) 65*71. 

[15] CoUinii C. and Made/i J. "TaotHe laniory replaeemant.'' /Voe. 5an Oisgo 
^oma dicol Sympotiurn. Yol. 13(1974) 15*26. 

[16] ColUni, C., Soadden. L. and AldenrA. "Mobility atudlaa «rith a laciila ima|ing 
device." Proc, Conf, on Sysisms and Ikviess for th§ /^tabled, Seattle, 
(1977) 170*174. 

[17] Witkln, A. "A Statistical Technique for Recovering Surface Orientation from 
Texture in Natural Imagery," In Proa. Pirsi NCAl. Stanford.CA. August, 
19B0, 1*3. 

[IB] Gcnnery, D. "A Stereo Vision System for an Autonomous Vehicle," In Proa, 
UCAJ‘77, Cambridge, klA, August, 1977, 578-5B2, 


omoiNAt 

OF POOR 





I 


